Mixture prior for sparse signals with dependent covariance structure

In this study, we propose an estimation method for normal mean problem that can have unknown sparsity as well as correlations in the signals. Our proposed method first decomposes arbitrary dependent covariance matrix of the observed signals into two parts: common dependence and weakly dependent error terms. By subtracting common dependence, the correlations among the signals are significantly weakened. It is practical for doing this because of the existence of sparsity. Then the sparsity is estimated using an empirical Bayesian method based on the likelihood of the signals with the common dependence removed. Using simulated examples that have moderate to high degrees of sparsity and different dependent structures in the signals, we demonstrate that the performance of our proposed algorithm is favorable compared to the existing method which assumes the signals are independent identically distributed. Furthermore, our approach is applied on the widely used “Hapmap” gene expressions data, and our results are consistent with the findings in other studies.


Introduction
Feature selection in normal mean with high-dimensional data is one of main challenges in Genetics studies. The feature selection in normal mean problem can be described as a pdimensional vector Y satisfying Y = X + Z. In particular, X is a sparse vector, that is, a great portion of elements of X are zeros and the proportion of zeros in X is unknown. Z is a vector of noises and has a distribution as Z � N ð0; ΣÞ, where Σ is the covariance matrix with arbitrary dependent structure. The purpose of this study is to find the desirable estimate of X vector which has the accurate sparsity information while considering correlations in the error vector Z.
Traditionally, there are mainly two branches of studies in feature selection under normal mean issue. One branch of studies focuses primarily on constructing and selecting subset of features that are useful to build a good predictor where Y could be the regression coefficients, e.g. [1][2][3][4]. Another branch of studies uses multiple hypothesis testing in genomics/bio-informatics where test statistics could be considered as Y, such as [5][6][7]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The study [8] was the first study that employed empirical Bayesian estimation method to find the sparsity in sparse vectors observed in Gaussian white noises. They assumed a mixture of point mass at zero and parametric distribution as the prior distribution for the sparse vectors, and inferred theoretical statistical properties for the estimators. [4] extended the work of [8] to allow more flexibility in the estimation by assuming the prior distribution as a mixture of point mass and nonparametric distribution. Nonetheless, both studies assumed that sparse signals in normal mean problem were independent identically distributed.
Another study [7] investigated the problem of controlling false discovery rate (FDR) with dependent structure in test statistics of multiple testing. They used an innovative principle factor approximation (PFA) procedure to significantly weaken the correlation structure. However, their study used a common p-value threshold to select the relevant features which can not be adapted to estimate sparsity in signals.
Our study extends the work of [4,8] to allow sparse signals to be correlated and have arbitrary covariance structure, and adopts the PFA procedure of [7] to estimate sparsity in signals automatically by assuming a mixture prior in an empirical Bayesian estimating setup. The motivation for allowing correlation in sparse signals comes from gene expressions data analyses that need to identify a proportion of genes associated with the outcome. There are usually statistical correlations between genes in gene expression data because of either biological reason (i.e. some genes are connected in biochemical pathways, [9]), or technical reason (i.e. imperfect normalization, [10]). Ignoring the correlation between genes may cause high variability of statistical estimators and even compromise their consistency [6,11]. Furthermore, there exists a sparsity problem: although the number of genes is large, there are maybe only a very small subset of genes that contribute to the outcome, [7].
To find the sparsity in a high-dimensional sparse vector with correlations, we first apply spectral decomposition on the covariance matrix of the error terms and take out the principle factors that derive the strong dependence in the covariance matrix so that the remaining factors have weak dependence. Because of the existence of sparsity, the approximation is accurate [7]. Then, we employ an empirical Bayesian method to estimate the sparsity based on the marginal likelihood of signals without strong dependence. We consider a num-ber of dependent structures in the sparse signals for simulation studies. The results demon-strate the advantages of our empirical Bayesian (DepEB) estimator considering dependence over the empirical Bayesian (EB) estimator proposed by [4] without considering depen-dence when there exists strong dependence in the covariance: 1) Our DepEB estimator pro-vides more accurate estimates of the sparsity than the EB estimator; 2) Our DepEB estimator gives smaller (integrated) mean squared errors (MSEs) than the EB estimator. In a real gene data application, we consider observed signals as the corresponding marginal regression coefficients of the genes on the outcome. There are strong correlations among the marginal coefficients because of the high association between genes. Our proposed DepEB estimator outperforms the EB estimator, and our results are also consistent with other studies.
In sum, our estimating method has two main features. First, our estimating method allows the covariance of the observed high-dimensional signals to have arbitrary dependent structure instead of assuming the signal vector follows an independent and identically distribution. Second, our estimating method incorporates the possibility of sparsity and uses a mixture prior with an atom of probability at zero and a non-parametric density for the nonzero part. We treat the mixture probability as well as the non-parametric part as hyperparameters, and we estimate them by an empirical Bayesian procedure automatically.
The remainder of this paper is organized as follows. Section Materials and Methods explicitly describes our model and estimation algorithm for our proposed estimator. In section Simulation Studies, the performance of the proposed estimator is evaluated by a number of simulation studies. Section Real Data Analysis presents the real data analysis using our proposed estimator. Section Conclusion concludes the paper.

Model
We are interested in estimating measurement error models where we observe only the errorcontaminated variable Y i in the equation: where Z 1 , . . ., Z p are measurement errors with joint distribution as N ð0; ΣÞ. This study allows arbitrary dependent structure in the covariance matrix Σ. X i is assumed to come from a mixture of a delta function with point mass at zero and a completely unspecified nonparametric density θ where δ(X i ) is the Dirac delta function when X i = 0 and ψ 2 [0, 1] describes the prior probability when X i = 0. This prior structure represents our belief that some of X 1 , . . ., X p are exactly zero, i.e. the mixing parameter ψ is the fraction of zeros in X 1 , . . ., X p corresponding to the number of irrelevant features. We treat ψ as a hyperparameter and estimate it using an empirical Bayesian approach. For the nonzero part of the prior θ(X i ), there were a number of parametric distribution prior specifications used by previous studies, such as Laplace distribution in [12], Student-t distribution in [13], and normal distribution in [14]. Although these previous studies have shown that the parametric specifications are successful, they all need to assume a specific shape for the nonzero part. This study takes a more realistic approach by leaving the nonzero part of the prior, θ(X i ), completely unspecified as proposed in [4].

Principal factor approximation
Based on the observation of Y 1 , . . ., Y p in Eq 1, we estimate the values for X 1 , . . ., X p . In the sparse scenario, estimatedX i need to find the correct degree of sparsity in X i . Because covariance matrix Σ of Z 1 , . . ., Z p can have arbitrary dependent structure, the first step of estimation is to remove dependence among the error terms Z i . Therefore, we approximate the likelihood function of signals Y 1 , . . ., Y p with weakly dependent normal random variables as proposed in [7].
The definition of weakly dependent normal random variables is stated as Definition 1 If a set of random variables (I 1 , . . ., I p ) T has the normal distribution N ððm 1 ; . . . ; m p Þ T ; CÞ and the (i, j)th element c ij in the covariance matrix C satisfying the condi- then I 1 , . . ., I p are called weakly dependent normal random variables. We apply principal factor approximation (PFA) technique in [7] to decompose the covariance matrix Σ so Eq 1 becomes a factor model with weakly dependent normal random errors. Before PFA procedure, the estimate of Σ need to be calculated since in real data set, true Σ is usually unknown. We propose to use sample covariance of the errors as the estimate for Σ.
The details of PFA procedure are described in the following steps: 1. Apply spectral decomposition on the covariance matrix Σ. Denote eigenvalues of Σ as λ 1 , . . ., λ p , which are arranged from the largest to the smallest. Denote the corresponding eigenvectors as γ 1 , . . ., γ p , then Σ ¼ i and k is chosen as the smallest k such that ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi l 2 kþ1 þ ::: þ l 2 for a predetermined small value �, say 0.05. Let L ¼ ð ffi ffi ffi ffi ffi Then the covariance matrix Σ can be decomposed as Σ = LL T + A.
. . . ; s because the existence of sparsity makes this approximation practical. c W 1 ; :::; c W k are obtained by least-absolute deviation regression on the approximation equation.
After applying PFA procedures on the covariance matrix Σ, we can re-write the normal mean problem with dependent error terms in Eq 1 as then Eq 5 can be rewritten as where K i 's are weakly dependent normal variables with distribution as N ð0; a 2 i Þ and a i ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi

Empirical bayesian estimation
Here, we estimate the hyperparameterĉ in Eq 2 using empirical Bayesian method. Posterior distribution. We first derive the posterior distribution based on the likelihood function Pð� i jX i Þ and the prior distribution of X i , for i = 1, . . ., p.
From Eq 6 we have Pð� i jX i Þ ¼ N ð� i jX i ; a i Þ. Since K i 's are weakly dependent, the likelihood of the parameters ϕ given the observations X can be approximated as Given the prior distribution of X in Eq 2, i.e. PðX i jc; yÞ and likelihood PðϕjXÞ, the posterior of X given ϕ can be written as where m(ϕ|ψ, θ) is the marginal of the data given the hyper-parameter ψ and the nonparametric distribution θ Replace Pð� i jX i Þ by N ð� i jX i ; a i Þ and substitute PðX i jc; yÞ by Eq 2, then Eq (9) can be rewritten as That is where Then the posterior of X i can be written as: then Notice that q i is the posterior density of X i when it is 0 and G(X i ) is the posterior density of X i when it is not 0. That is PðX i j� i ; c; y; X i 6 ¼ 0Þ ¼ GðX i Þ: Estimate the hyperparameter ψ. We use an empirical Bayesian approach in [15] to estimate the hyperparameter ψ by maximizing the marginal likelihood m(ϕ|ψ, θ) Notice proposed by [4] to estimate g(ϕ i ) directly. In this way, we do not have to specify any prior distribution form for the non-zero part of X i 's. We use a weighted non-parametric kernel density estimator in [4] to estimate g of the following form where R is a kernel function satisfying R R(x)dx = 1 and ν is a positive number called bandwidth of the kernel. The most widely used kernel is a normal density with zero mean and unit variance, that is, RðxÞ ¼ N ðxj0; 1Þ. Therefore, we set the bandwidth of the kernel using the normal reference rule n ¼ Oðp À 1=5 Þ as in [16].
In summary, the hyperparameter ψ can be estimated by the following iterative steps Algorithm 1. Given the current estimateĝð� i Þ, obtainĉ by maximizing the log-marginal in Eq (12) numerically.

2.
Computeq i using the current estimatedĉ andĝð� i Þ 3. Re-estimateĝð� i Þ using the current estimate ofq î Posterior mean. We use the mean of the posterior as a point estimate for non-zero part of X i 's.X where and the estimate of marginalĝ ð� i Þ is obtained using Eq 13 and the estimate of derivativeĝ 0 ð� i Þ is given byĝ

Simulation studies
In this section, we conduct simulation studies to evaluate our proposed estimation procedure and make comparisons between our method with the one in [4]. We simulate the data for Eq 1. A sequence of p = 1000 signals Y i 's are generated with 200 repetitions in the simulation study.
We generate X i using a type of non-zero distribution with different degree of sparsity. Two different types of distribution are considered for the non-zero values in X i 's: 1) Uninormal and 2) Binormal. The detailed descriptions for the distribution of X i 's used in the simulation studies are summarized in Table 1. We let X i equal to zero at randomly selected positions and the proportion of zeros in X i ' ranges from 0.6 to 0.9 with interval of 0.1, which represents the situation from low sparsity to high sparsity. Because the existence of sparsity makes the PFA approximation practical as specified in Eq 5, we consider relatively high values of sparsity parameter ψ in the simulation studies.
The observed value of Y i is generated by adding noise Z i for each X i based on Eq 1. Let Z � N ð0; ΣÞ. Our simulation studies consider six different dependent structures in the covariance of noises Z. The dependent structures of Σ are similar to the settings used in [7], and they are generated using random variables fB i g p i¼1 with a sample size of n = 100. The covariance matrix Σ is the sample correlation matrix of B 1 , � � �, B p .

<
: Binormal (Bimodal Normal)  Notice that the correlation structures of the error terms in these four models are very close to independent structure. In these cases, the results obtained by using our proposed DepEB estimation algorithm (in red color) and the EB estimator (in gray color) both can give estimates close to the true values of ψ and give small standard deviations of the estimates.
The last two sub-plots in panels (a) of Figs 2 and 3 show the estimated results when the error terms are generated using Equal Correlation of 0.4 and 0.8 structures, respectively. As shown in the heatmap Fig 1, these two types of dependence structure have much stronger dependence among error terms than the previous four dependence structures. It can be seen that our proposed DepEB estimators give quite accurate estimates and small standard deviations (in red color), which indicates our estimation algorithm can adapt quite well to high correlation in error terms. The distribution of error terms is clearly mis-specified if ignoring the dependence structure. Therefore, the EB estimators tend to underestimate ψ and give large standard deviations (in gray color), which indicates that the proportion of nonzeros would be overestimated and existence of big bias.
To evaluate the accuracy of the estimation for the non-zero part of X, we calculate mean squared error MSE ¼ 1=p The following findings can be seen from panels (b) in Figs 2 and 3: 1) MSEs estimated by our DepEB algorithm (in red color) are always lower than those estimated by the EB algorithm (in gray color). When correlations in error terms are very strong (e.g. Equal Correlation of 0.8 model), MSEs estimated by our proposed DepEB algorithm are much lower than those estimated by the EB algorithm; 2) As the signal becomes more sparse, the MSEs estimated by both the DepEB algorithm and EB algorithm become lower; 3) Similar to the findings in standard

Real data analysis
We use gene expression data set to test the validity of our proposed DepEB estimation algorithm. The gene expression data used in this study are for 90 unrelated Asians from the international "HapMap" project [17], which include 45 Japanese in Tokyo, Janpan (JPT) and 45 Han Chinese in Beijing, China (CHB). The gene expression data were generated by an

Distribution of marginal gene effects
We consider a outcome of interest z and gene data stored in a n × p matrix S. The gene data represent p gene expressions for n individuals. More specifically, element S ij in S represents the j th gene of the i th individual. With the data, we perform marginal linear regression between the outcome variable z and gene S j Let α j and β j be the solutions to Eq 15, andb j be the least squares estimators for β j for j = 1, . . ., We assume the sample correlation between S j and S k isr jk , the sample standard deviation of S j isŝ jj , the conditional distribution of z given S 1 , . . ., S p is N ðmðS 1 ; . . . ; S p Þ; s 2 Þ. The covariance of any two least squares estimatorsb j andb k is Furthermore, since β j is the solution to Eq 15, b j ¼ EððS T j S j Þ À 1 S T j zÞ ¼ Eðb j Þ. Therefore, the joint distribution of least squares estimatorsb 1 ; . . . ;b p is N ððb 1 ; . . . ; b p Þ T ; Σ * Þ, where the (j, k) th element of covariance matrix Σ* is Σ * jk ¼ s 2r jk =nŝ jjŝkk . Next, we standardizeb 1 ; . . . ;b p using their standard deviations as suggested in [7]: Then, the joint distribution of standardized estimatorsÛ 1 ; . . . ;Û p is N ððm 1 ; . . . ; m p Þ T ; ΣÞ, where m j ¼ ffi ffi ffi n p b jŝjj =s for j = 1, . . ., p and the (j, k) th element of covariance matrix Σ is Σ jk ¼r jk . It is obvious that if S 1 , . . ., S p are correlated, thenÛ 1 ; . . . ;Û p are also correlated.
We can rewrite the standardized marginal regression estimatorsÛ j as a model of true marginal regression coefficients and error terms of the marginal regression coefficients as the form in Eq 1Û where the true marginal genetic effects μ j are assumed to follow a mixture distribution of a point mass at 0 (corresponding to no effect) and a nonparametric distribution (corresponding to having effects). The error terms of the standardized marginal regression coefficients υ 1 , . . ., υ p follow a distribution of N ð0; ΣÞ. The covariance matrix Σ can have arbitrary dependent structure. We use empirical estimatesŝ 2 ,r jk andŝ jj to calculate the standardized estimators and Σ.

Gene expression outcome and highly dependent covariance structure
From the gene expression data, we take the gene expressions of CHRNA6 (cholinergic receptor, nicotinic, alpha 6) as the outcome variable (z) in Eq 15. Because the gene CHRNA6 is known to be related to activation of dopamine releasing neurons with nicotine [21], it is a widely studied subject in nicotine addiction studies.
In the remaining gene expressions, we use the method in [19] to first calculate correlations between a gene and all other genes, then count the number of correlations that are greater than 0.6 for this gene. If the count of >0.6 correlations for a gene is greater than 2, 200, we consider this gene as highly correlated with other genes. Therefore, there are 2, 269 genes satisfying this criterion. We take each of the 2, 269 genes as S j in Eq 15, and store the gene data in the n × p matrix S, where n = 90, p = 2, 269. Fig 4 illustrates the heatmap of the absolute values of correlation among the 2, 269 genes.
Next, we regress CHRNA6 (z) on each of the 2, 269 gene expressions (S j ). We use empirical estimatesŝ 2 ,r jk andŝ jj to calculate the standardized least squares regression coefficients (Û j ), which have the distribution as shown in Eq 17. Lastly, using our proposed empirical Bayesian method (DepEB) described in Section, we obtain weakly dependent estimates and estimate the sparsity parameter and posterior distribution of the genes' true marginal effect on CHRNA6. As a comparison, we also apply the EB estimation procedure proposed by [4] on the highly correlated gene data set.

Analysis results
Our proposed DepEB estimation procedure considering dependence can identify 14 genes (0.617% of the studied 2, 269 genes) to be associated with CHRNA6, while the EB estimation procedure without considering dependence can not identify genes related to CHRNA6. Table 2 lists the gene names and the values of the posterior mean of the genes' effects estimated by the DepEB procedure on CHRNA6.
Our findings are consistent with the findings in [19,20], both of which also used the 90 JPT-CHB population's gene data to identify significant genes associated with CHRNA6. Our proposed DepEB procedure selects genes GI_14249217-S, GI_19923528-S, GI_32189368-S, GI_32261328-S, GI_42659728-S, GI_4506330-S and GI_9256536-S, which are also among genes identified by [19] to be related to CHRNA6. In particular, [19] found that gene GI_42659728-S is very likely and gene GI_4506330-S is extremely likely to be related to the outcome CHRNA6. In addition, gene GI_32189368-S, i.e. POLE2, Homo sapiens polymerase (DNA directed), epsilon 2 (p59 subunit), was also discovered by [20] to be related to CHRNA6.

Conclusion
In feature selection of normal mean problem, we propose a new DepEB estimator which extends the empirical Bayesian (EB) estimator method proposed by [4]. Our estimator allows arbitrary dependent structure in the error terms. We first apply eigenvalue decomposition to decompose correlated signals to common dependence and weakly dependent random errors. Next, we subtract the common dependence from the signals. We then get approximate likelihood of the sparse signals using the weakly dependent errors. The existence of sparsity makes the approximation feasible. Finally, we use an iterative maximization algorithm based on nonparametric kernel density to find the sparsity and the posterior distribution of the signals in a Bayesian estimating model.  Table 2. Genes associated with CHRNA6 estimated by the DepEB procedure.

Gene Name Posterior Mean of the Gene's effect on CHRNA6
In our simulation studies, we consider a number of covariance structures in relatively high sparse signals. The simulation results illustrate that when there exists moderate or strong correlations in the signals, our DepEB estimation procedure can correctly find the sparsity of the signals and produce a lower MSE compared to that in the EB estimation procedure ignoring the correlation structure in the sparse signals. Furthermore, we apply our proposed estimation procedure in the 90 JPT-CHB population's gene expression data set with a highly dependent covariance structure. Our proposed DepEB estimation method outperforms the EB estimation method and identifies genes that are associated with the outcome gene CHRNA6. The findings in the real data analysis are consistent with those in other studies. Hence, this study provides an useful application for feature selection when facing strongly dependent covariates.
In the real data analysis of this study, we use marginal regression coefficients as the observed signals for the sparse vector. The reason of using marginal information is to deal with high dimensionality, see studies in [7,22]. Therefore, one possible direction for future research is to use the appropriate initial estimators from the full regression model as the observed signals for sparse vector.